Add Output Area crosswalk and geographic assignment (Phase 1) by vahid-ahmadi · Pull Request #291 · PolicyEngine/policyengine-uk-data

vahid-ahmadi · 2026-03-16T10:55:32Z

Background

This PR implements Phase 1 of a 6-phase pipeline to enable Output Area (OA) level calibration — the UK equivalent of the US Census Block approach.

Why are we doing this?

The US pipeline (policyengine-us-data) uses a clone-and-prune approach that produces much finer geographic granularity than our current UK methodology:

Aspect	UK (current)	US (target approach)
Finest geography	Constituency (650) / LA (360)	Census Block (~8M)
Method	PyTorch Adam reweighting — every household keeps some weight	L0-regularized optimization — actively drops records for sparsity
Cloning	None — each FRS household appears once	Records cloned 10-430x, each clone assigned different geography
Weight matrix	Dense (n_areas × n_households)	Sparse (most clones pruned to zero)

This PR is going down to Output Area level (~235K OAs across the UK), which is the UK equivalent of the US Census Block. This PR is the first step.

What this PR does (Phase 1: OA Crosswalk & Geographic Assignment)

1. Unified UK Output Area Crosswalk

Downloads and combines geographic lookups from three national statistics agencies into a single crosswalk:

```
OA → LSOA/DataZone → MSOA/IntermediateZone → LA → Constituency → Region → Country
```

Data sources:

England & Wales (188,880 OAs): ONS Open Geography Portal — OA21→LSOA21→MSOA21→LAD22 exact-fit lookup, OA21→PCON25 constituency lookup, Census 2021 TS001 population
Scotland (46,363 OAs): NRS OA22→DZ22→IZ22 lookup, statistics.gov.scot DZ22→LA→constituency lookup, OA22→UKPC24 direct constituency mapping
Northern Ireland: NISRA DZ2021 lookup (currently returning 404 — gracefully skipped, to be added when URLs are updated)

Output: `storage/oa_crosswalk.csv.gz` (1.4MB compressed) — 235,243 areas, 65M population, 632 constituencies, 363 LAs, 11 regions

2. Geographic Assignment Engine

Assigns population-weighted random Output Areas to cloned FRS household records, with two key constraints:

Country constraint: English FRS households only get assigned to English OAs, Welsh to Welsh, etc.
Constituency collision avoidance: Each clone of the same household gets a different constituency (up to 50 retry iterations), ensuring geographic diversity across clones.

3. Tests — 19 passing, 1 skipped (NI)

Validates crosswalk completeness (OA counts, population totals, hierarchy nesting, country prefixes) and assignment correctness (country constraints, collision avoidance, population-weighted sampling, save/load roundtrip).

Known limitations

Northern Ireland excluded — NISRA download URLs returning 404. Handled gracefully.
Scotland OA population is uniform (~117 per OA) — NRS OA population CSV returning 403. Falls back to equal estimate.
No NI constituency mapping — needs separate lookup.

What comes next (Phases 2-6)

Phase 2: Clone-and-Assign

Clone each FRS household N times (start with N=10), assign each clone a different OA. Insert into `create_datasets.py` after imputations, before calibration.
US ref: PRs #457, #531

Phase 3: L0 Calibration Engine

Port L0-regularized optimization from US side. HardConcrete gates to actively drop records, producing sparse datasets. Add `l0-python` dependency.
US ref: PRs #364, #365

Phase 4: Sparse Matrix Builder

Build sparse `(n_targets × n_records*n_clones)` calibration matrix. Simulate PolicyEngine-UK per clone, wire existing `targets/sources/` into sparse matrix rows.
US ref: PRs #456, #489

Phase 5: SQLite Target Database

Hierarchical target storage: UK → Country → Region → LA → Constituency → MSOA → LSOA → OA. Migrate existing CSV/Excel targets into SQLite.
US ref: PRs #398, #488

Phase 6: Local Area Publishing

Generate per-area H5 files from sparse weights. Modal integration for scale.
US ref: PR #465

File summary

File	Lines	Purpose
`docs/oa_calibration_pipeline.md`	96	Full 6-phase roadmap
`calibration/init.py`	0	New calibration package
`calibration/oa_crosswalk.py`	878	Downloads & builds unified UK OA crosswalk from ONS/NRS/NISRA
`calibration/oa_assignment.py`	310	Population-weighted OA assignment with constraints
`storage/oa_crosswalk.csv.gz`	—	Pre-built crosswalk (235K areas, 1.4MB)
`tests/test_oa_crosswalk.py`	264	19 passing + 1 skipped test

Port the US-side clone-and-prune calibration methodology to the UK, starting with Output Area (OA) level geographic infrastructure: - Build unified UK OA crosswalk from ONS, NRS, and NISRA data (235K areas: 189K E+W OAs + 46K Scotland OAs) - Population-weighted OA assignment with country constraints - Constituency collision avoidance for cloned records - Tests validating crosswalk completeness and assignment correctness This is Phase 1 of a 6-phase pipeline to enable OA-level calibration, analogous to the US Census Block approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek

Hi Vahid,

Most of this is from our boy Claude, as usual. This looks like a great setup! Can't wait to see HHs getting donated to the OAs! I'll approve, but please see the issues Claude found below.

Here's the code I used to poke around:

  from policyengine_uk_data.calibration.oa_crosswalk import load_oa_crosswalk
  xw = load_oa_crosswalk()                                                                                                                                    
  xw                                                                                                                                                          
                                                                                                                                                              
  # Population-weighted sampling demo                                 
  import numpy as np

  xw["population"] = xw["population"].astype(float)

  eng = xw[xw["country"] == "England"].copy()
  eng["prob"] = eng["population"] / eng["population"].sum()

  rng = np.random.default_rng(42)
  idx = rng.choice(len(eng), size=10_000, p=eng["prob"].values)
  sampled = eng.iloc[idx]

  sampled.groupby("oa_code")["population"].agg(["count", "first"]).rename(
      columns={"count": "times_sampled", "first": "population"}
  ).sort_values("times_sampled", ascending=False).head(20)

leads to:

Out[1]: 
           times_sampled  population
oa_code                             
E00179944              5      3354.0
E00035641              3       279.0
E00039569              3       263.0
E00066618              3       331.0
E00115325              2       319.0
E00136307              2       301.0
E00089585              2       333.0
E00167257              2       472.0
E00130843              2       406.0
E00021422              2       190.0
E00004742              2       313.0
E00044937              2       294.0
E00089725              2       240.0
E00044974              2       400.0
E00160095              2       401.0
E00016512              2       305.0
E00016490              2       380.0
E00089915              2       514.0
E00021502              2       396.0
E00105618              2       305.0

Interesting: "E00179944 with population 3,354 is a massive outlier (most OAs are 100–300 people)"

Bugs

1. `load_oa_crosswalk` loads population as string

load_oa_crosswalk() passes dtype=str for all columns (line 753 of oa_crosswalk.py), so population comes back as a string. This means any downstream arithmetic (e.g. computing probabilities) fails with TypeError: unsupported operand type(s) for /: 'str' and 'str'. Should either drop dtype=str or explicitly cast population to int on load.

2. NI households silently get no assignment

The crosswalk has 0 NI rows (NISRA 404), which is acknowledged, but assign_random_geography will silently produce None entries for NI households (country code 4). Worth either raising an error or logging a warning when a household's country has no distribution.

Code quality

3. Dead code in `_assign_regions`

Lines 602–606 of oa_crosswalk.py:

for k, v in la_to_region.items():
    if k[:3] == la_code[:3]:
        # Same LA type prefix
        pass

This loop does nothing — should be removed or finished.

4. Assignment inner loop should be vectorised

In oa_assignment.py lines 236–245, the for i, pos in enumerate(positions) loop storing results can be replaced with vectorised numpy indexing:

oa_codes[start + positions] = dist["oa_codes"][indices]

Same for all the other arrays. Will matter when n_clones * n_records gets large.

Worth noting

5. Scotland population weighting is effectively uniform

The fallback of ~117 per OA for all 46k Scottish OAs means population-weighted sampling is actually uniform for Scotland. This undermines the premise for ~20% of UK OAs. Might be worth a louder warning or a TODO to revisit once NRS fixes the 403.

baogorek

Approving Phase 1 — the crosswalk and assignment engine look good. Please see my comment above for a few things to address before merge.

vahid-ahmadi and others added 2 commits March 16, 2026 10:53

Add changelog fragment for OA calibration pipeline

e66446d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vahid-ahmadi requested a review from baogorek March 16, 2026 11:08

baogorek reviewed Mar 17, 2026

View reviewed changes

baogorek approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Output Area crosswalk and geographic assignment (Phase 1)#291

Add Output Area crosswalk and geographic assignment (Phase 1)#291
vahid-ahmadi wants to merge 2 commits intomainfrom
oa-calibration-pipeline

vahid-ahmadi commented Mar 16, 2026 •

edited

Loading

Uh oh!

baogorek left a comment •

edited

Loading

Uh oh!

baogorek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vahid-ahmadi commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Why are we doing this?

What this PR does (Phase 1: OA Crosswalk & Geographic Assignment)

1. Unified UK Output Area Crosswalk

2. Geographic Assignment Engine

3. Tests — 19 passing, 1 skipped (NI)

Known limitations

What comes next (Phases 2-6)

Phase 2: Clone-and-Assign

Phase 3: L0 Calibration Engine

Phase 4: Sparse Matrix Builder

Phase 5: SQLite Target Database

Phase 6: Local Area Publishing

File summary

Uh oh!

baogorek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Bugs

1. load_oa_crosswalk loads population as string

2. NI households silently get no assignment

Code quality

3. Dead code in _assign_regions

4. Assignment inner loop should be vectorised

Worth noting

5. Scotland population weighting is effectively uniform

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vahid-ahmadi commented Mar 16, 2026 •

edited

Loading

baogorek left a comment •

edited

Loading

1. `load_oa_crosswalk` loads population as string

3. Dead code in `_assign_regions`